Clustering metrics - alternatives to the elbow method¶

Dr. Tirthajyoti Sarkar, Fremont, CA 94536¶

Clustering is an important part of the machine learning pipeline for business or scientific enterprises utilizing data science. As the name suggests, it helps to identify congregations of closely related (by some measure of distance) data points in a blob of data, which, otherwise, would be difficult to make sense of.

A popular method like k-means clustering does not seem to provide a completely satisfactory answer when we ask the basic question: 

"How would we know the actual number of clusters, to begin with?"

This question is critically important because of the fact that the process of clustering is often a precursor to further processing of the individual cluster data and therefore, the amount of computational resource may depend on this measurement. 

In the case of a business analytics problem, repercussion could be worse. Clustering is often done for such analytics with the goal of market segmentation. It is, therefore, easily conceivable that, depending on the number of clusters, appropriate marketing personnel will be allocated to the problem. Consequently, a wrong assessment of the number of clusters can lead to sub-optimum allocation of precious resources.

For the k-means clustering method, the most common approach for answering this question is the so-called elbow method. It involves running the algorithm multiple times over a loop, with an increasing number of cluster choice and then plotting a clustering score as a function of the number of clusters.

In this notebook, we show what metric to use for visualizing and determining an optimal number of clusters much better than the usual practice - elbow method.


In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
n_features = 30
n_cluster = 3
# cluster_std = 1.2
# n_samples = 200
In [ ]:
df1 = pd.read_pickle('PAT_3415_wref/PAT_3415_wref_2023-08-23.pkl')


y=df1['Class']
df1.drop('Class',inplace=True,axis=1)
df1
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
GA1 0.483715 0.265764 -0.119434 -0.048157 -0.179256 -0.155573 1.187963 0.952131 -0.086996 -0.397896 ... -0.063387 -0.262305 -0.323652 -0.308666 -0.591139 -0.389984 0.026109 -0.020250 0.113928 0.026257
GA2 0.398606 0.028651 -0.203806 -0.117415 -0.136427 -0.114856 0.588441 0.499296 -0.328895 -0.367832 ... -0.313107 -0.438148 -0.385843 -0.261361 -0.386349 -0.354296 -0.073311 -0.142873 -0.045190 0.021992
GA3 0.043357 0.010885 -0.104572 -0.153744 -0.182513 -0.102815 0.523810 0.663739 -0.045354 -0.133836 ... 0.126345 0.809905 0.266149 -0.036997 -0.131652 -0.203400 0.007803 -0.021141 0.132494 0.046397
GA4 0.410730 0.080677 0.036813 0.046381 -0.096244 -0.080055 -0.080726 -0.129069 -0.340858 -0.052157 ... 0.006090 0.984634 0.482303 0.260356 -0.220176 -0.091140 -0.056507 -0.014549 0.051525 0.017754
GA5 -0.143545 0.366892 0.107602 -0.191617 -0.166783 -0.030808 1.012558 0.869969 0.159457 -0.519511 ... -0.120698 0.458651 0.248589 -0.063716 -0.572379 -0.618667 -0.166705 -0.059766 -0.031125 -0.044360
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
HLG14 -0.817090 -1.327095 -0.533528 -0.277863 -0.369859 -0.155220 -3.180712 -3.586907 -2.039388 -1.878885 ... -1.233885 -1.501601 -1.171114 -1.202723 -0.023732 -0.120402 0.034683 -0.434207 0.173213 -0.045776
HLG15 -0.891595 -0.966441 -0.521640 -0.413265 -0.386647 -0.248881 -3.857659 -3.783381 -1.582642 -1.449072 ... -1.079907 -0.846336 -0.370741 -0.215898 -0.523800 -0.406407 -0.049618 -0.170514 0.229267 0.289204
HLG16 -0.986400 -1.172226 0.090839 -0.226294 -0.134856 -0.048451 -3.553981 -3.698886 -1.380141 -0.985199 ... -1.195962 -0.639447 -0.436414 -0.202022 -0.253749 -0.075727 0.114317 0.139163 0.266820 0.302262
HLG17 -0.527283 -0.458474 -0.149502 -0.201963 -0.093792 -0.110353 -3.402531 -3.229067 -1.917119 -1.246712 ... -1.344066 -0.738237 -0.382400 -0.365429 0.145283 0.352366 -0.048786 0.160611 0.382168 0.202780
HLG18 -0.909549 -0.687828 -0.136858 -0.044270 0.050755 0.285514 -2.379198 -2.333656 -1.630201 -1.335235 ... -1.260918 -0.444553 -0.070491 -0.239631 0.381152 0.182001 -0.153096 -0.140565 0.306024 0.199339

492 rows × 30 columns

In [ ]:
from itertools import combinations
In [ ]:
lst_vars=list(combinations(df1.columns,2))
In [ ]:
len(lst_vars)
Out[ ]:
435
In [ ]:
for k in range(1,15):
    plt.figure(figsize=(15,8))

    for i in range(1,30):
        plt.subplot(6,5,i)
        dim1=lst_vars[i-1][0]
        dim2=lst_vars[i-1][1]
        plt.scatter(df1[dim1][0:163],df1[dim2][0:163],c=df1[k][0:163],edgecolor='green',s=10)
        plt.scatter(df1[dim1][163:163*2],df1[dim2][163:163*2],c=df1[k][163:163*2],edgecolor='blue',s=10)
        plt.scatter(df1[dim1][163*2:163*3],df1[dim2][163*2:163*3],c=df1[k][163*2:163*3],edgecolor='red',s=10)
        plt.xlabel(f"{dim1}",fontsize=13)
        plt.ylabel(f"{dim2}",fontsize=13)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
plt.figure(figsize=(15,8))
for i in range(1,30):
    plt.subplot(6,5,i)
    dim1=lst_vars[i-1][0]
    dim2=lst_vars[i-1][1]
    plt.scatter(df1[dim1],df1[dim2],c=df1[1],edgecolor='green',s=20)
    plt.xlabel(f"{dim1}",fontsize=13)
    plt.ylabel(f"{dim2}",fontsize=13)
No description has been provided for this image

How are the classes separated (boxplots)¶

In [ ]:
plt.figure(figsize=(16,14))
for i,c in enumerate(df1.columns):
    plt.subplot(6,5,i+1)
    sns.boxplot(y=df1[c],x=df1[1])
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlabel("Class",fontsize=15)
    plt.ylabel(c,fontsize=15)
    #plt.show()

k-means clustering¶

In [ ]:
from sklearn.cluster import KMeans

Unlabled data¶

In [ ]:
X=df1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 X=df1.T[:-1].T

NameError: name 'df1' is not defined
In [ ]:
X.tail()
Out[ ]:
GA1 GA2 GA3 GA4 GA5 GA6 GA7 GA8 GB1 GB2 ... HLG9 HLG10 HLG11 HLG12 HLG13 HLG14 HLG15 HLG16 HLG17 HLG18
25 -0.389984 -0.354296 -0.2034 -0.09114 -0.618667 -0.6195 -0.731027 -0.834111 -0.194137 -0.187608 ... 0.603079 0.474329 0.196768 0.640251 0.33228 -0.120402 -0.406407 -0.075727 0.352366 0.182001
26 0.026109 -0.073311 0.007803 -0.056507 -0.166705 -1.059621 -0.757889 -0.354309 0.040578 0.014403 ... -0.026981 0.204087 0.19126 0.611314 0.151943 0.034683 -0.049618 0.114317 -0.048786 -0.153096
27 -0.02025 -0.142873 -0.021141 -0.014549 -0.059766 -0.217249 -0.057531 -0.174444 0.128981 -0.025893 ... -0.035159 0.480328 0.474677 0.505634 -0.373749 -0.434207 -0.170514 0.139163 0.160611 -0.140565
28 0.113928 -0.04519 0.132494 0.051525 -0.031125 -0.263776 -0.076652 -0.067927 0.025302 0.093717 ... 0.011437 0.550851 0.561869 0.739036 0.09719 0.173213 0.229267 0.26682 0.382168 0.306024
29 0.026257 0.021992 0.046397 0.017754 -0.04436 -0.057738 -0.014697 -0.052808 0.029368 0.026001 ... -0.268874 0.196157 0.240362 0.577535 0.306662 -0.045776 0.289204 0.302262 0.20278 0.199339

5 rows × 492 columns

In [ ]:
# y=df1['Class']

Scaling¶

In [ ]:
from sklearn.preprocessing import MinMaxScaler
In [ ]:
scaler = MinMaxScaler()
In [ ]:
X_scaled=scaler.fit_transform(X)

Metrics¶

In [ ]:
from sklearn.metrics import silhouette_score, davies_bouldin_score,v_measure_score

Running k-means and computing inter-cluster distance score for various k values¶

In [ ]:
km_scores= []
km_silhouette = []
vmeasure_score =[]
db_score = []
for i in range(2,3):
    km = KMeans(n_clusters=i, random_state=0).fit(X_scaled)
    preds = km.predict(X_scaled)
    
    print("Score for number of cluster(s) {}: {}".format(i,km.score(X_scaled)))
    km_scores.append(-km.score(X_scaled))
    
    silhouette = silhouette_score(X_scaled,preds)
    km_silhouette.append(silhouette)
    print("Silhouette score for number of cluster(s) {}: {}".format(i,silhouette))
    
    db = davies_bouldin_score(X_scaled,preds)
    db_score.append(db)
    print("Davies Bouldin score for number of cluster(s) {}: {}".format(i,db))
    
    v_measure = v_measure_score(y,preds)
    vmeasure_score.append(v_measure)
    print("V-measure score for number of cluster(s) {}: {}".format(i,v_measure))
    print("-"*100)
Cannot execute code, session has been disposed. Please try restarting the Kernel.
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
In [ ]:
plt.figure(figsize=(7,4))
plt.title("The elbow method for determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=km_scores,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("K-means score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 plt.figure(figsize=(7,4))
      2 plt.title("The elbow method for determining number of clusters\n",fontsize=16)
      3 plt.scatter(x=[i for i in range(2,12)],y=km_scores,s=150,edgecolor='k')

NameError: name 'plt' is not defined
In [ ]:
plt.scatter(x=[i for i in range(2,12)],y=vmeasure_score,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("V-measure score")
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(7,4))
plt.title("The silhouette coefficient method \nfor determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=km_silhouette,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("Silhouette score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()
No description has been provided for this image
In [ ]:
plt.scatter(x=[i for i in range(2,12)],y=db_score,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Davies-Bouldin score")
plt.show()
No description has been provided for this image

Expectation-maximization (Gaussian Mixture Model)¶

In [ ]:
from sklearn.mixture import GaussianMixture
In [ ]:
gm_bic= []
gm_score=[]
for i in range(2,12):
    gm = GaussianMixture(n_components=i,n_init=10,tol=1e-3,max_iter=1000).fit(X_scaled)
    print("BIC for number of cluster(s) {}: {}".format(i,gm.bic(X_scaled)))
    print("Log-likelihood score for number of cluster(s) {}: {}".format(i,gm.score(X_scaled)))
    print("-"*100)
    gm_bic.append(-gm.bic(X_scaled))
    gm_score.append(gm.score(X_scaled))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 4
      2 gm_score=[]
      3 for i in range(2,12):
----> 4     gm = GaussianMixture(n_components=i,n_init=10,tol=1e-3,max_iter=1000).fit(X_scaled)
      5     print("BIC for number of cluster(s) {}: {}".format(i,gm.bic(X_scaled)))
      6     print("Log-likelihood score for number of cluster(s) {}: {}".format(i,gm.score(X_scaled)))

NameError: name 'X_scaled' is not defined
In [ ]:
plt.figure(figsize=(7,4))
plt.title("The Gaussian Mixture model BIC \nfor determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(2,12)],y=np.log(gm_bic),s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("Log of Gaussian mixture BIC score",fontsize=15)
plt.xticks([i for i in range(2,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 3
      1 plt.figure(figsize=(7,4))
      2 plt.title("The Gaussian Mixture model BIC \nfor determining number of clusters\n",fontsize=16)
----> 3 plt.scatter(x=[i for i in range(2,12)],y=np.log(gm_bic),s=150,edgecolor='k')
      4 plt.grid(True)
      5 plt.xlabel("Number of clusters",fontsize=14)

File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/pyplot.py:2862, in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, data, **kwargs)
   2857 @_copy_docstring_and_deprecators(Axes.scatter)
   2858 def scatter(
   2859         x, y, s=None, c=None, marker=None, cmap=None, norm=None,
   2860         vmin=None, vmax=None, alpha=None, linewidths=None, *,
   2861         edgecolors=None, plotnonfinite=False, data=None, **kwargs):
-> 2862     __ret = gca().scatter(
   2863         x, y, s=s, c=c, marker=marker, cmap=cmap, norm=norm,
   2864         vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths,
   2865         edgecolors=edgecolors, plotnonfinite=plotnonfinite,
   2866         **({"data": data} if data is not None else {}), **kwargs)
   2867     sci(__ret)
   2868     return __ret

File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/__init__.py:1442, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1439 @functools.wraps(func)
   1440 def inner(ax, *args, data=None, **kwargs):
   1441     if data is None:
-> 1442         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1444     bound = new_sig.bind(ax, *args, **kwargs)
   1445     auto_label = (bound.arguments.get(label_namer)
   1446                   or bound.kwargs.get(label_namer))

File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/axes/_axes.py:4584, in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs)
   4582 y = np.ma.ravel(y)
   4583 if x.size != y.size:
-> 4584     raise ValueError("x and y must be the same size")
   4586 if s is None:
   4587     s = (20 if mpl.rcParams['_internal.classic_mode'] else
   4588          mpl.rcParams['lines.markersize'] ** 2.0)

ValueError: x and y must be the same size
No description has been provided for this image
In [ ]:
plt.scatter(x=[i for i in range(2,12)],y=gm_score,s=150,edgecolor='k')
plt.show()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 plt.scatter(x=[i for i in range(2,12)],y=gm_score,s=150,edgecolor='k')
      2 plt.show()

File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/pyplot.py:2862, in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, data, **kwargs)
   2857 @_copy_docstring_and_deprecators(Axes.scatter)
   2858 def scatter(
   2859         x, y, s=None, c=None, marker=None, cmap=None, norm=None,
   2860         vmin=None, vmax=None, alpha=None, linewidths=None, *,
   2861         edgecolors=None, plotnonfinite=False, data=None, **kwargs):
-> 2862     __ret = gca().scatter(
   2863         x, y, s=s, c=c, marker=marker, cmap=cmap, norm=norm,
   2864         vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths,
   2865         edgecolors=edgecolors, plotnonfinite=plotnonfinite,
   2866         **({"data": data} if data is not None else {}), **kwargs)
   2867     sci(__ret)
   2868     return __ret

File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/__init__.py:1442, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1439 @functools.wraps(func)
   1440 def inner(ax, *args, data=None, **kwargs):
   1441     if data is None:
-> 1442         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1444     bound = new_sig.bind(ax, *args, **kwargs)
   1445     auto_label = (bound.arguments.get(label_namer)
   1446                   or bound.kwargs.get(label_namer))

File ~/anaconda3/envs/env_ERPS/lib/python3.11/site-packages/matplotlib/axes/_axes.py:4584, in Axes.scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs)
   4582 y = np.ma.ravel(y)
   4583 if x.size != y.size:
-> 4584     raise ValueError("x and y must be the same size")
   4586 if s is None:
   4587     s = (20 if mpl.rcParams['_internal.classic_mode'] else
   4588          mpl.rcParams['lines.markersize'] ** 2.0)

ValueError: x and y must be the same size
No description has been provided for this image
In [ ]:
 
In [ ]: